path assignment
Routing for Large ML Models
Cohen, Ofir, Schapira, Jose Yallouz Michael, Belkar, Shahar, Mizrahi, Tal
The communication Our aim is to devise methodologies for the online adaptation patterns induced by these training process exhibit of routing configurations in ML training clusters that high regularity and persistence, giving rise to significant improve global training efficiency and fairness. Our approach opportunities for optimizing the manner in which flows are builds on two characteristics of ML training and modern networking: routed across the network. We present an algorithmic framework for quantifying network-wide efficiency in the context of training LLMs (and other large-scale ML models), and for periodically optimizing routing with respect to this global Traffic patterns induced by ML training tend to exhibit metric.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
- Information Technology (0.69)
- Transportation (0.48)
- Telecommunications (0.48)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Winning Solution of the AIcrowd SBB Flatland Challenge 2019-2020
This report describes the main ideas of the solution which won the AIcrowd SBB Flatland Challenge 2019-2020, with a score of 99% (meaning that, on average, 99% of the agents were routed to their destinations within the allotted time steps). The details of the task can be found on the competition's website. The solution consists of 2 major components: 1) A component which (re-)generates paths over a time-expanded graph for each agent 2) A component which updates the agent paths after a malfunction occurs, in order to try to preserve the same agent ordering of entering each cell as before the malfunction. The goal of this component is twofold: a) to (try to) avoid deadlocks b) to bring the system back to a consistent state (where each agent has a feasible path over the time-expanded graph) I am discussing both of these components, as well as a series of potentially promising, but unexplored ideas, below. The invariant for this component is that every agent always has an assigned path (where it will be located at each time step over the whole time horizon), and this component only tries to improve the overall path assignment). Initially, all the agents have a default path assigned which doesn't enter the environment at all (they always just stay at their initial location, outside the environment).